{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b9bba344bbe0b4bd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# AI21SemanticTextSplitter\n",
    "\n",
    "This example goes over how to use AI21SemanticTextSplitter in LangChain."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8e4cdb63fbc34ec",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b09bb1cd2c7e036a",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pip install langchain-ai21"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba1d80fe8d82be89",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Environment Setup\n",
    "\n",
    "We'll need to get a AI21 API key and set the AI21_API_KEY environment variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "844b8f744d22bcb6",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"AI21_API_KEY\"] = getpass()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e670b278e6b2b9e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Example Usages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f61c5c981f01ad31",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text by semantic meaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7da988112712cf3",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d82b65c9b8684f3",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "chunks = semantic_text_splitter.split_text(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(chunks)} chunks.\")\n",
    "for chunk in chunks:\n",
    "    print(chunk)\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e8d1fcf818a8a81",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text by semantic meaning with merge"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c307abbc216fe89f",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning, then merging the chunks based on `chunk_size`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5651c581fcc1ff02",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter_chunks = AI21SemanticTextSplitter(chunk_size=1000)\n",
    "chunks = semantic_text_splitter_chunks.split_text(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(chunks)} chunks.\")\n",
    "for chunk in chunks:\n",
    "    print(chunk)\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b464db855e547cbb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text to documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4410e8467012b193",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a type for each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cf131d9be910115",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "documents = semantic_text_splitter.split_text_to_documents(TEXT)\n",
    "\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"type: {doc.metadata['source_type']}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b544ba21335d01a6",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Creating Documents with Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c67f8c3ad89b8ad2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to create Documents from texts, and adding custom Metadata to each Document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe222d0e85249bda",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "texts = [TEXT]\n",
    "documents = semantic_text_splitter.create_documents(\n",
    "    texts=texts, metadatas=[{\"pikachu\": \"pika pika\"}]\n",
    ")\n",
    "\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"metadata: {doc.metadata}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8b5682c34142319",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting text to documents with start index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "359ea797c03ece85",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a start index for each document.\n",
    "**Note** that the start index provides an indication of the order of the chunks rather than the actual start index for each chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dc39002f0c25784",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter(add_start_index=True)\n",
    "documents = semantic_text_splitter.create_documents(texts=[TEXT])\n",
    "print(f\"The text has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"start_index: {doc.metadata['start_index']}\")\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62939cc5803b9fb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Splitting documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44162d340c0de5fb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This example shows how to use AI21SemanticTextSplitter to split a list of Documents into chunks based on semantic meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8950c8e4e1208bf6",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from langchain_ai21 import AI21SemanticTextSplitter\n",
    "from langchain_core.documents import Document\n",
    "\n",
    "TEXT = (\n",
    "    \"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, \"\n",
    "    \"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\\n\"\n",
    "    \"Imagine a company that employs hundreds of thousands of employees. In today's information \"\n",
    "    \"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise \"\n",
    "    \"here, given that some of these documents are long and convoluted on purpose (did you know that \"\n",
    "    \"reading through all your privacy policies would take almost a quarter of a year?). Aside from \"\n",
    "    \"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of \"\n",
    "    \"Employees Read Their Employment Contracts Entirely Before Signing!).\\nThis is where AI-driven summarization \"\n",
    "    \"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, \"\n",
    "    \"users can (ideally) quickly extract relevant information from a text. With large language models, \"\n",
    "    \"the development of those tools is easier than ever, and you can offer your users a summary that is \"\n",
    "    \"specifically tailored to their preferences.\\nLarge language models naturally follow patterns in input \"\n",
    "    \"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed \"\n",
    "    'them with several examples in the input (\"few-shot prompt\"), so they can follow through. '\n",
    "    \"The process of creating the correct prompt for your problem is called prompt engineering, \"\n",
    "    \"and you can read more about it here.\"\n",
    ")\n",
    "\n",
    "semantic_text_splitter = AI21SemanticTextSplitter()\n",
    "document = Document(page_content=TEXT, metadata={\"hello\": \"goodbye\"})\n",
    "documents = semantic_text_splitter.split_documents([document])\n",
    "print(f\"The document list has been split into {len(documents)} Documents.\")\n",
    "for doc in documents:\n",
    "    print(f\"text: {doc.page_content}\")\n",
    "    print(f\"metadata: {doc.metadata}\")\n",
    "    print(\"====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f911b8d9ec22e5",
   "metadata": {
    "collapsed": false
   },
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
